Further Optimal Regret Bounds for Thompson Sampling

نویسندگان

  • Shipra Agrawal
  • Navin Goyal
چکیده

The second last inequality follows from the observation that the event E i (t) was defined as μ̂i(t) > xi, At time τk+1 for k ≥ 1, μ̂i(τk+1) = Si(τk+1) k+1 ≤ Si(τk+1) k , where latter is simply the average of the outcomes observed from k i.i.d. plays of arm i, each of which is a Bernoulli trial with mean μi. Using Chernoff-Hoeffding bounds (Fact 1), we obtain that Pr(μ̂i(τk + 1) > xi) ≤ Pr(ik k > xi) ≤ eii.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Near-optimal Regret Bounds for Thompson Sampling

Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson ...

متن کامل

A Near-optimal Regret Bounds for Thompson Sampling1

Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson ...

متن کامل

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Git...

متن کامل

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...

متن کامل

An Information-Theoretic Analysis of Thompson Sampling

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013